Abstract


Checkout the presentation at: https://youtu.be/cCZ3wZs-RkA Attrition is the voluntary loss of employees, and here at DDSAnalytics, we’re looking to determine the factors that lead to this. We determined that three primary facors are over time, montly income, and job roles. Due to this, predictors for monthly income were also analyzed. It was determined that traveling, total years worked, and job level were significant indicators. Fianlly, it was noticed that sales reps were of the highest percentage when it came to attrition, and mangers and research directors were amoung the most experienced and bringing in the highest monthly income.

Introduction


DDSAnalytics is a Fortune 100 company that specializes in talent development. A recent effort for their data science team is to determine predicting factors for turn over rates. The interest lies in the top three factors, and whether there are any trends to specific job roles. The following document shows the effort to determined these high interest variables. ## Data Description *** The dataset recieved contained roughly 33 different factors and over 850 observations. The factors ranged from job related attributes such as departments to personal affairs such as marriage. #### Read in the data and load in all the libraries

# load in the libraries
library(stringi)
library(plotly)
library(plyr)
library(class)
library(caret)
library(e1071)
library(MASS)
library(gridExtra)
library(grid)
library(ggplot2)
library(lattice)
library(tidyverse)
library(shinydashboard)
library(shiny)
library(glue)
library(reshape)
library(dplyr)
library(FNN) 
library(gmodels) 
library(psych)
require(ggiraph)
require(ggiraphExtra)
require(plyr)

# read in the dataframe
data = read.csv("../data_sets/CaseStudy2-data.csv",header = TRUE)

EDA - Exploratory Data Analysis


Clean Data

After checking the data for duplicates and null values, it was determined that the data was clean and no additional cleaning would be required.

# check for unique employees only
distinct_df<-distinct(data, EmployeeNumber)

# check for complete observations 
#df[complete.cases(data), ]

Explore Data

Set the datasets

categorical variables

Attrition, BusinessTravel, Department, EducationField, Gender, JobRole, MaritalStatus, OverTime, JobLevel, JobSatisfaction, JobInvolvement, PerformanceRating, RelationshipSatisfaction,StandardHours, StockOptionLevel, NumCompaniesWorked, TrainingTimesLastYear, WorkLifeBalance, Education, YearsInCurrentRole, EnvironmentSatisfaction

continuous variables

ID, Age, Attrition, DailyRate, DistanceFromHome, EmployeeNumber, HourlyRate, MonthlyIncome, MonthlyRate, PercentSalaryHike, TotalWorkingYears, YearsAtCompany, YearsSinceLastPromotion, YearsWithCurrManager

# create a dataframe with categorical, removing Over 18 and 
categorical <- subset(data, select = c(Attrition, BusinessTravel, Department, EducationField, Gender, JobRole, MaritalStatus, OverTime, JobLevel, JobSatisfaction, JobInvolvement, PerformanceRating, RelationshipSatisfaction, StockOptionLevel, NumCompaniesWorked, TrainingTimesLastYear, WorkLifeBalance, Education, YearsInCurrentRole, EnvironmentSatisfaction))

# create a dataframe with all continuous
continuous <- subset(data, select = c(ID, Age, Attrition, DailyRate, DistanceFromHome, EmployeeNumber, HourlyRate, MonthlyIncome, MonthlyRate, PercentSalaryHike, TotalWorkingYears, YearsAtCompany, YearsSinceLastPromotion, YearsWithCurrManager, StandardHours) )

Determine relationships to Attrition

Numerical relations

Reviewing the plots below, the numerical factors that appear to correlate with higher attrition rates is:

  • Monthly Income
  • Percent Salary Hike
  • Total Working Years
  • Years at the Company
  • Years Since Last Promotion

The were selected due to the clustering or correlation R2 of positive attrition.

# differentiate colors by attrition
my_cols <- c("#00AFBB", "#E7B800") 

df_attrition_no <- continuous %>% filter(Attrition == "No")
df_attrition_yes <- continuous %>% filter(Attrition == "Yes")

# Correlation panel
panel.cor <- function(x, y){
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    r <- round(cor(x, y), digits=2)
    txt <- paste0("R = ", r)
    cex.cor <- 0.8/strwidth(txt)
    text(0.5, 0.5, txt, cex = cex.cor * r)
}
# Customize upper panel
upper.panel<-function(x, y){
  points(x,y, pch = 1, col = my_cols[continuous$Attrition])
}

# Customize upper panel
upper.panel<-function(x, y){
  points(x,y, pch = 1, col = my_cols[df_attrition_yes$Attrition])
}
pairs(df_attrition_yes[,1:5], lower.panel = panel.cor, upper.panel = upper.panel) # age vs. business

pairs(df_attrition_yes[,5:10], lower.panel = panel.cor, upper.panel = upper.panel) # age vs. business

pairs(df_attrition_yes[,10:15], lower.panel = panel.cor, upper.panel = upper.panel) # age vs. business

Categorical Relations

It’s apparent that only specific variables are significant in terms of attrition. They variables are listed below: * Job Role Marital Status Overtime Job Level * Job Involvement Stock Option Level These variables has the most stark differences when it comes to the percentage differences between the variables.

It is interesting to note that gender, education, and job satisfaction didn’t have a major impact on attrition.

#compare the plots with attrition being the explainatory factor (Categorical)
data$AttritionF <- as.factor(data$Attrition)
categorical_variables = colnames(categorical)

graphing <- function(data, categorical_variables) {
  for (val in categorical_variables) {
    plot <- ggplot(data=data)+aes_string(val)+geom_bar(aes(fill=as.factor(AttritionF)), position="fill")+ ggtitle(glue("{val}"))
    print(plot)
  }
}
graphing(data, categorical_variables)

##### Test variables for relevance Pull the variables visually analyzed and test for relevance in the final model. Since we are testing for a categorical variable, KNN, is utlized.

ETL

Scaling the data to prevent loss of patterns that would not been seen if not completed. Extract the data that’s determined the most relevant from the analysis above.

# Scale the data

# scale the data
data$MonthlyIncomeScaled <- scale(data$MonthlyIncome)
data$PercentSalaryHikeScaled <- scale(data$PercentSalaryHike)
data$TotalWorkingYearsScaled <- scale(data$TotalWorkingYears)
data$YearsAtCompanyScaled <- scale(data$YearsAtCompany)
data$YearsSinceLastPromotionScaled <- scale(data$YearsSinceLastPromotion)

# grab all significant variables for testing
variables_graphed <- subset(data, select = c(ID, Attrition, JobRole, MaritalStatus, OverTime, JobLevel, JobInvolvement, StockOptionLevel, WorkLifeBalance, MonthlyIncome, PercentSalaryHike, TotalWorkingYears, YearsAtCompany, YearsSinceLastPromotion, MonthlyIncomeScaled, PercentSalaryHikeScaled, TotalWorkingYearsScaled, YearsAtCompanyScaled, YearsSinceLastPromotionScaled))

need_dummycode <- c("JobRole", "MaritalStatus", "OverTime")
for (val in need_dummycode) {
  temp <- as.data.frame(dummy.code(data[[val]]))
  variables_graphed <- cbind(variables_graphed, temp)
}

Its determined that K-15 results is the best KNN-model. * Monthly Salary * Overtime * Role

at K = 7, seed = 7

# cross validation knn
set.seed(7)
splitPerc = .80

#split data
trainIndices = sample(1:dim(variables_graphed)[1],round(splitPerc * dim(variables_graphed)[1]))
train = variables_graphed[trainIndices,]
test = variables_graphed[-trainIndices,]

# create dataframe to hold KNN results
accs = data.frame(accuracy = numeric(20), k = numeric(20))

# run KNN algorithm 
for(i in 1:20)
{
  classifications = knn(train[,c(32,33,15,20,21,22,23,24,25,26,27,28)],test[,c(32,33,15,20,21,22,23,24,25,26,27,28)],train$Attrition, prob = TRUE, k = i)
  #print(table(test$Attrition,classifications))
  tryCatch({ CM <- confusionMatrix(table(test$Attrition,classifications))
    accs$accuracy[i] = CM$overall[1]
    accs$specificity[i] = CM$byClass[2]
    accs$sensitivity[i] = CM$byClass[1]
    accs$k[i] = i}, 
    error = function(e) {
    accs$accuracy[i] = 0
    accs$specificity[i] = 0
    accs$sensitivity[i] = 0
    accs$k[i] = 0
    })
  #CM <- confusionMatrix(table(test$Attrition,classifications))
}
accs
##     accuracy  k specificity sensitivity
## 1  0.7528736  1   0.2222222   0.8913043
## 2  0.8620690  2   0.4615385   0.8944099
## 3  0.8448276  3   0.3888889   0.8974359
## 4  0.8678161  4   0.5000000   0.8902439
## 5  0.8563218  5   0.4375000   0.8987342
## 6  0.8735632  6   0.5384615   0.9006211
## 7  0.8563218  7   0.4375000   0.8987342
## 8  0.8850575  8   0.6363636   0.9018405
## 9  0.8735632  9   0.5384615   0.9006211
## 10 0.8850575 10   0.6363636   0.9018405
## 11 0.8793103 11   0.5833333   0.9012346
## 12 0.9022989 12   0.8750000   0.9036145
## 13 0.9022989 13   0.8750000   0.9036145
## 14 0.9022989 14   0.8750000   0.9036145
## 15 0.9022989 15   0.8750000   0.9036145
## 16 0.9022989 16   0.8750000   0.9036145
## 17 0.9022989 17   0.8750000   0.9036145
## 18 0.9022989 18   0.8750000   0.9036145
## 19 0.9022989 19   0.8750000   0.9036145
## 20 0.8908046 20   0.8333333   0.8928571
plot(accs$k,accs$accuracy, type = "l", xlab = "k", main="K Iterations and Accuracy") 

Test KNN against different seeds
accs1 = data.frame(accuracy = numeric(20), k = numeric(20))
# run KNN algorithm 
for(i in 1:20)
{
  # set seed
  set.seed(i)
  splitPerc = .80
  
  
  #split data
  trainIndices = sample(1:dim(variables_graphed)[1],round(splitPerc * dim(variables_graphed)[1]))
  train = variables_graphed[trainIndices,]
  test = variables_graphed[-trainIndices,]
# run KNN algorithm 

  classifications = knn(train[,c(32,33,15,20,21,22,23,24,25,26,27,28)],test[,c(32,33,15,20,21,22,23,24,25,26,27,28)],train$Attrition, prob = TRUE, k = 7)
  #print(table(test$Attrition,classifications))
  tryCatch({ CM <- confusionMatrix(table(test$Attrition,classifications))
    accs1$accuracy[i] = CM$overall[1]
    accs1$specificity[i] = CM$byClass[2]
    accs1$sensitivity[i] = CM$byClass[1]
    accs1$seed[i] = i}, 
    error = function(e) {
    accs1$accuracy[i] = 0
    accs1$specificity[i] = 0
    accs1$sensitivity[i] = 0
    accs1$seed[i] = 0
    })
  #CM <- confusionMatrix(table(test$Attrition,classifications))
}
accs1
##     accuracy k specificity sensitivity seed
## 1  0.8448276 0   0.6428571   0.8625000    1
## 2  0.8563218 0   0.5714286   0.8682635    2
## 3  0.8045977 0   0.5000000   0.8271605    3
## 4  0.8735632 0   0.2857143   0.8982036    4
## 5  0.8620690 0   0.8000000   0.8658537    5
## 6  0.8793103 0   0.7058824   0.8980892    6
## 7  0.8563218 0   0.4375000   0.8987342    7
## 8  0.8793103 0   0.8333333   0.8827160    8
## 9  0.8448276 0   0.5714286   0.8687500    9
## 10 0.8505747 0   0.5000000   0.8860759   10
## 11 0.7988506 0   0.4375000   0.8354430   11
## 12 0.8965517 0   0.6666667   0.9182390   12
## 13 0.8678161 0   0.6000000   0.8841463   13
## 14 0.8850575 0   0.8181818   0.8895706   14
## 15 0.8103448 0   0.5294118   0.8407643   15
## 16 0.8505747 0   0.6666667   0.8606061   16
## 17 0.8563218 0   0.6923077   0.8695652   17
## 18 0.8505747 0   0.5294118   0.8853503   18
## 19 0.8908046 0   0.7142857   0.8982036   19
## 20 0.8735632 0   0.6363636   0.8895706   20
summary(accs1)
##     accuracy            k      specificity      sensitivity          seed      
##  Min.   :0.7989   Min.   :0   Min.   :0.2857   Min.   :0.8272   Min.   : 1.00  
##  1st Qu.:0.8491   1st Qu.:0   1st Qu.:0.5221   1st Qu.:0.8650   1st Qu.: 5.75  
##  Median :0.8563   Median :0   Median :0.6182   Median :0.8834   Median :10.50  
##  Mean   :0.8566   Mean   :0   Mean   :0.6069   Mean   :0.8764   Mean   :10.50  
##  3rd Qu.:0.8750   3rd Qu.:0   3rd Qu.:0.6957   3rd Qu.:0.8917   3rd Qu.:15.25  
##  Max.   :0.8966   Max.   :0   Max.   :0.8333   Max.   :0.9182   Max.   :20.00
ggplot(data = data, aes(x=JobRole, y=MonthlyIncome, color=Attrition)) + geom_point(position="jitter") + theme(axis.text.x = element_text(angle = 90, hjust = 1))

ggplot(data = data, aes(x=OverTime, y=MonthlyIncome, z=JobRole, color=Attrition)) + geom_point(position="jitter")

Data Conclusions


Attrition

Attrition is defined as reducing/preventing voluntary employee turnover. Here, if the value is no, this means that the employee was not lost, and yes means that the company did lose the employee.

Factors that Lead to Attrition

While many factors were considered to lead to attrition, it was determined that the top 3 factors are 1. Monthly Income 2. Over Time 3. Job Role

Model for Attrition Prediction

The final model takes into account all three factors, with a mean specificity of 0.61 and sensitivity of 0.87 at K-7.

# select desired columns from the data sets 
training_set <- train %>% select(c("OverTime", "JobRole", "MonthlyIncome"))
testing_set <- test %>% select(c("OverTime", "JobRole", "MonthlyIncome"))

#since the data needs to be dummy coded for categorical, the following is more accurate
training <- train[,c(32,33,15,20,21,22,23,24,25,26,27,28)]
testing<- test[,c(32,33,15,20,21,22,23,24,25,26,27,28)]

# run the data
CM <- knn(training,testing,train$Attrition, prob = TRUE, k = 7)

# save the column to the desired dataset
testing$prediction<- CM
Predicting Attrition

At minimum, the attrition should be at 60% and the sensitivity should be 60%. Aftering running the model to recieve predicted attrition values, the dataset was ordered by id and exported to a csv.

# load in the attrition data
attrition_competition = read.csv("../data_sets/CaseStudy2CompSet No Attrition.csv", header=TRUE)


# order by id
attach(attrition_competition)
attrition_competition_ordered <- attrition_competition[order(ID),]

# run predictions on the competition model

# 1. dummy code data
need_dummycodes <- c("JobRole", "OverTime")
for (val in need_dummycodes) {
  temp <- as.data.frame(dummy.code(attrition_competition_ordered[[val]]))
  attrition_competition_ordered <- cbind(attrition_competition_ordered, temp)
}

# 2. Scale the monthly income
attrition_competition_ordered$MonthlyIncomeScaled <- scale(attrition_competition_ordered$MonthlyIncome)

# 3. pull the train and validation sets
training <- train[,c(32,33,15,20,21,22,23,24,25,26,27,28)]
testing <- attrition_competition_ordered %>% select(c("Research Scientist","Sales Executive","Laboratory Technician","Manufacturing Director","Manager","Healthcare Representative","Research Director","Sales Representative","Human Resources","No","Yes","MonthlyIncomeScaled"))

# 4. Run the model to get predictions
CM <- knn(training,testing,train$Attrition, prob = TRUE, k = 7)

# 5. save the column to the desired dataset
attrition_competition_ordered$Attrition<- CM

# 6. filter for desired columns
output_attrition <- attrition_competition_ordered %>% select(c("ID", "Attrition"))
# export the attrition predicted data
write.csv(output_attrition, "../data_sets/output_data_sets/Case2Predictions_JamieVo Attrition.csv")

Monthly Incomes

Model for Monthly Incomes

  1. First scatter plots are run to determine the relationship and predictors for the monthly income. This is also tested by using the initial data set. From the first model, its seen that
  • intitially, there appears to be a linear relationship between total working years and monthly income.
  1. running a stepwise linear regression to narrow down the factors, the following were found to be significant. After running a stepwise regression model, its noted that the factors that most attribute to Monly income is
  • Business Travel
  • Performance rating
  • Percent Salary Hike
  • Years Since Last promotion
  • Gender (Male)
  • Job Level
  • Total Years Working
  • Years with current Manager
  • Daily Rate
  • Distance from Home
  • Job Role
  1. The final model that was the simplest contains JobLevel + TotalWorkingYears + Travel_Rarel.
# 1. dummy code all the data 
dummy_code2 <- c("BusinessTravel","Department","EducationField","Gender","JobRole","MaritalStatus","OverTime")
for (val in dummy_code2) {
  temp <- as.data.frame(dummy.code(data[[val]]))
  data <- cbind(data, temp)
}

data$attrition_no <- ifelse(data$Attrition =="No", 1, 0) #dummy code attrition

names(data)<-make.names(names(data),unique = TRUE) # remove spaces from the name

# 2. initial reduction of factors

scatter.smooth(x=data$TotalWorkingYears, y=data$MonthlyIncome, main="") 

lm_step1 <- step(lm(MonthlyIncome~Age
 + DailyRate + DistanceFromHome + Education + EmployeeNumber
 + EnvironmentSatisfaction + HourlyRate + JobInvolvement + JobLevel
 + JobSatisfaction + MonthlyRate + NumCompaniesWorked
 + PercentSalaryHike + PerformanceRating + RelationshipSatisfaction + StandardHours
 + StockOptionLevel + TotalWorkingYears + TrainingTimesLastYear + WorkLifeBalance
 + YearsAtCompany + YearsInCurrentRole + YearsSinceLastPromotion + YearsWithCurrManager
 + Travel_Rarely + Travel_Frequently
 + Non.Travel + Research...Development + Sales + Human.Resources
 + Life.Sciences + Medical + Marketing + Technical.Degree
 + Other + Human.Resources + Male + Female
 + Sales.Executive + Research.Scientist + Laboratory.Technician + Manufacturing.Director
 + Healthcare.Representative + Sales.Representative + Manager + Research.Director
 + Human.Resources + Married + Single + Divorced
 + No + Yes + attrition_no, data = data),direction="both")

An interesting note is that Males are in the model, while females are not. We find that the sex of the candidate affects the model.

# 3. customizing model and checking all assumptions. Here, the switch 

# split data into train adn test set
  set.seed(8)
  splitPerc = .80
  
  
  #split data
  trainIndices = sample(1:dim(data)[1],round(splitPerc * dim(data)[1]))
  train = data[trainIndices,]
  test = data[-trainIndices,]
# Add remove any rates considering they likely feed into the monthly income, this is not helpful
lm_back <- step(lm( MonthlyIncome ~  DistanceFromHome + 
    JobLevel  + PercentSalaryHike + PerformanceRating + 
    TotalWorkingYears + YearsSinceLastPromotion + YearsWithCurrManager + 
    Travel_Rarely + Travel_Frequently + Male + Sales.Executive + 
    Laboratory.Technician + Manufacturing.Director + Healthcare.Representative + 
    Manager + Research.Director, data = train), direction = "backward")

lm_forward <- step(lm(MonthlyIncome ~ JobLevel + TotalWorkingYears + YearsSinceLastPromotion + 
    YearsWithCurrManager + Travel_Rarely + Male + Sales.Executive + 
    Laboratory.Technician + Manufacturing.Director + Healthcare.Representative + 
    Manager + Research.Director, data = train), direction = "forward")
# run the linear regression
linearMod <- lm(MonthlyIncome ~ JobLevel + TotalWorkingYears + Travel_Rarely, data=train)  # build linear regression model on full data
print(linearMod)
## 
## Call:
## lm(formula = MonthlyIncome ~ JobLevel + TotalWorkingYears + Travel_Rarely, 
##     data = train)
## 
## Coefficients:
##       (Intercept)           JobLevel  TotalWorkingYears      Travel_Rarely  
##          -2029.79            3704.57              56.68             303.75
summary(linearMod) 
## 
## Call:
## lm(formula = MonthlyIncome ~ JobLevel + TotalWorkingYears + Travel_Rarely, 
##     data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5310.4  -846.6    24.4   773.0  3805.5 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -2029.79     133.56 -15.198  < 2e-16 ***
## JobLevel           3704.57      74.99  49.400  < 2e-16 ***
## TotalWorkingYears    56.68      10.90   5.202 2.59e-07 ***
## Travel_Rarely       303.75     113.10   2.686  0.00741 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1348 on 692 degrees of freedom
## Multiple R-squared:  0.9149, Adjusted R-squared:  0.9145 
## F-statistic:  2478 on 3 and 692 DF,  p-value: < 2.2e-16

Assumptions for linear regression

  1. Normality
  2. Constant Variance
  3. Linearity
  4. Outliers

Overall, it can be argued that no major deviations from the assumptions and the linear regression model has a good fit.

# 1, Normality
sresid <- studres(linearMod)
hist(sresid, freq=FALSE,
   main="Distribution of Studentized Residuals")
xfit<-seq(min(sresid),max(sresid),length=40)
yfit<-dnorm(xfit)
lines(xfit, yfit)

# 2. Constant Variance

plot(sresid) + title("Rediduals for Constant Variance")

## integer(0)
# 3. Linearity
ggPredict(linearMod,se=TRUE,interactive=TRUE)
## Warning: package 'gdtools' was built under R version 3.6.2
# 4. Outliers 
plot(linearMod)

Full Statistics

# prediction of the model
require(graphics)
pred <- predict(linearMod, test)


actuals_preds <- data.frame(cbind(actuals=test$MonthlyIncome, predicteds=pred))  # make actuals_predicteds dataframe.
correlation_accuracy <- cor(actuals_preds)  # 94.38%
head(actuals_preds)
##    actuals predicteds
## 2    19626  17987.171
## 10    5063   5832.822
## 14    2476   2035.214
## 15    3423   2545.369
## 22    8120   6363.307
## 24    5332   5946.190
summary(linearMod)
## 
## Call:
## lm(formula = MonthlyIncome ~ JobLevel + TotalWorkingYears + Travel_Rarely, 
##     data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5310.4  -846.6    24.4   773.0  3805.5 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -2029.79     133.56 -15.198  < 2e-16 ***
## JobLevel           3704.57      74.99  49.400  < 2e-16 ***
## TotalWorkingYears    56.68      10.90   5.202 2.59e-07 ***
## Travel_Rarely       303.75     113.10   2.686  0.00741 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1348 on 692 degrees of freedom
## Multiple R-squared:  0.9149, Adjusted R-squared:  0.9145 
## F-statistic:  2478 on 3 and 692 DF,  p-value: < 2.2e-16
# Calculate RMSE
RSS <- c(crossprod(linearMod$residuals))
MSE <- RSS / length(linearMod$residuals)
RMSE <- sqrt(MSE)
Predicting Monthly Incomes

The model attains a RMSE < $3000 for the training and the validation set.

# load in the monthly income data
monthly_income_competition = read.csv("../data_sets/CaseStudy2CompSet No Salary.csv", header=TRUE)


# order by id
attach(attrition_competition)
## The following objects are masked from attrition_competition (pos = 4):
## 
##     Age, BusinessTravel, DailyRate, Department, DistanceFromHome,
##     Education, EducationField, EmployeeCount, EmployeeNumber,
##     EnvironmentSatisfaction, Gender, HourlyRate, ID, JobInvolvement,
##     JobLevel, JobRole, JobSatisfaction, MaritalStatus, MonthlyIncome,
##     MonthlyRate, NumCompaniesWorked, Over18, OverTime,
##     PercentSalaryHike, PerformanceRating, RelationshipSatisfaction,
##     StandardHours, StockOptionLevel, TotalWorkingYears,
##     TrainingTimesLastYear, WorkLifeBalance, YearsAtCompany,
##     YearsInCurrentRole, YearsSinceLastPromotion, YearsWithCurrManager
monthly_income_competition_ordered <- monthly_income_competition[order(ID),]

# transform the data
# 1. dummy code all the data 
dummy_code2 <- c("BusinessTravel","Department","EducationField","Gender","JobRole","MaritalStatus","OverTime")
for (val in dummy_code2) {
  temp <- as.data.frame(dummy.code(monthly_income_competition_ordered[[val]]))
  monthly_income_competition_ordered <- cbind(monthly_income_competition_ordered, temp)
}

# predict the values
pred <- predict(linearMod, monthly_income_competition_ordered)

monthly_income_competition_ordered$MonthlyIncome = pred

output1<- as.data.frame(cbind(ID=monthly_income_competition_ordered$ID, MonthlyIncome=monthly_income_competition_ordered$MonthlyIncome))

# export the attrition predicted data
write.csv(output1, "../data_sets/output_data_sets/Case2Predictions_JamieVo Salary.csv")

Conclusion

In conclusion, it was determined that the three factors that provide high specificity and sensitivity to attrition are monhtly salary, overtime, and job role. Those who worked over time tended to have higher attrition rates, along with those in the lower brackets of monthly oncome. As for job roles, it can be seen in the graphs above that sales reps have high attrition rate. The KNN model designed looked a the nearest 7 neighbors to determine the attrition classification, resulting in ~60% sensitivity and ~87% specificity.

The most important predictors for monthly income is whether the employee rarely travels, their job level and their total years working. This is intuitive, but interesting to see that there essentially wasn’t a major emphasis on role. The multiple linear regression model showed that roughly 91% of the montly income could be explained by the variables included, with an RMSE of ~$1400.00.

For job specific roles, there were not major trends noticed in the initial analysis. An interesting point is that Research Directors and Managers tend to have worked a greater number of total years and monthly income. The cluster of sales and technicians towards lower work years and monthly income would hopefully be explained by their movement up into management.